Retrieval over Images, Tables, and PDFs

Indexing and retrieving from complex documents with vision-language models, multi-vector retrieval, and LlamaParse

Published

April 30, 2025

Keywords: multimodal RAG, LlamaParse, ColPali, multi-vector retriever, vision-language model, PDF parsing, table extraction, image retrieval, Unstructured, GPT-4o, document understanding, OCR, layout detection, semi-structured data, LlamaIndex, LangChain

Introduction

Most RAG tutorials assume your documents are clean text. In reality, the documents that matter most — financial reports, research papers, technical manuals, slide decks, medical records — are visually rich PDFs packed with tables, charts, diagrams, and images that carry critical information.

Standard text-based RAG fails on these documents in predictable ways:

Tables get flattened into meaningless strings when extracted as raw text
Charts and diagrams are invisible to text-only pipelines — they’re simply discarded
Page layouts with multi-column formatting, sidebars, and footnotes produce garbled text
Scanned documents yield nothing without OCR, and OCR introduces errors

The gap is stark. LangChain’s benchmark on investor slide decks showed that text-only RAG scored 20% accuracy on questions about visual content, while multimodal approaches reached 60–90%. The information is there — it’s just locked in visual formats that text pipelines can’t see.

This article covers the full spectrum of solutions: from intelligent document parsing (LlamaParse, Unstructured) to multi-vector retrieval strategies, vision-based document embeddings (ColPali), and end-to-end multimodal RAG pipelines in LlamaIndex and LangChain.

The Problem: Why Text Extraction Breaks

What Gets Lost

Consider a typical financial report PDF. A standard text extraction pipeline (PyPDF, pdfplumber) produces output like:

Revenue Q1 Q2 Q3 Q4
Product A 12.3 14.1 15.8 18.2
Product B 8.7 9.2 10.1 11.5
Total 21.0 23.3 25.9 29.7

If you’re lucky. More often you get:

Revenue Q1 Q2 Q3 Q4 Product A 12.3 14.1 15.8 18.2 Product B 8.7 9.2
10.1 11.5 Total 21.0 23.3 25.9 29.7

Or worse — columns merged, rows split, headers detached from data. When this garbage gets chunked and embedded, the resulting vectors are meaningless. A query like “What was Product A revenue in Q3?” retrieves chunks that contain the right numbers but in the wrong structure, leading to hallucinated answers.

The Document Complexity Spectrum

graph LR
    A["Plain Text<br/>(Markdown, TXT)"] --> B["Simple PDF<br/>(Text-only)"]
    B --> C["Semi-Structured<br/>(Text + Tables)"]
    C --> D["Multi-Modal<br/>(Text + Tables + Images)"]
    D --> E["Scanned/Complex<br/>(OCR + Layout)"]

    style A fill:#27ae60,color:#fff,stroke:#333
    style B fill:#4a90d9,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#e67e22,color:#fff,stroke:#333
    style E fill:#e74c3c,color:#fff,stroke:#333

Document Type	Example	Text Extraction Quality	Solution
Plain text	Markdown, code	Perfect	Standard RAG
Simple PDF	Text-only reports	Good	PyPDF / pdfplumber
Semi-structured	Tables + text	Poor for tables	Unstructured / LlamaParse
Multi-modal	Charts, diagrams, photos	Tables degraded, images lost	Multi-vector retriever + VLM
Scanned	Paper scans, old docs	Nothing without OCR	OCR + layout detection

Approach 1: Intelligent Document Parsing

The first strategy is to extract structure faithfully before embedding. Instead of treating PDFs as flat text, use parsers that understand document layout.

LlamaParse

LlamaParse is LlamaIndex’s document parsing service that uses vision-language models to understand page layout and extract structured content — including tables rendered as proper Markdown, image descriptions, and hierarchical sections.

from llama_cloud import LlamaCloud, AsyncLlamaCloud

client = AsyncLlamaCloud(api_key="llx-...")

# Upload and parse a document
file_obj = await client.files.create(
    file="./quarterly_report.pdf",
    purpose="parse",
)

result = await client.parsing.parse(
    file_id=file_obj.id,
    tier="agentic",      # highest quality — uses VLM for layout understanding
    version="latest",
    output_options={
        "markdown": {
            "tables": {
                "output_tables_as_markdown": True,  # tables as Markdown tables
            },
        },
        "images_to_save": ["screenshot"],  # save page screenshots
    },
    expand=["text", "markdown", "items", "images_content_metadata"],
)

# Access structured markdown output
for page in result.markdown.pages:
    print(page.markdown)

# Access extracted tables programmatically
for page in result.items.pages:
    for item in page.items:
        if hasattr(item, "rows"):  # table item
            print(f"Table on page {page.page_number}: "
                  f"{len(item.rows)} rows")

LlamaParse tiers:

Tier	Method	Best For	Cost
Fast	Rule-based extraction	Simple text-only PDFs	Lowest
Standard	Layout detection + OCR	Semi-structured documents	Medium
Agentic	Vision-language model	Complex layouts, figures, tables	Highest

Integration with LlamaIndex RAG:

from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)

# LlamaParse returns Documents with rich markdown
# Tables are preserved as proper Markdown tables
# Images get text descriptions
index = VectorStoreIndex.from_documents(
    parsed_documents,  # from LlamaParse
    show_progress=True,
)

query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What was Product A revenue in Q3?")

Unstructured

Unstructured is an open-source library that partitions documents into typed elements — text blocks, tables, images, headers — using layout detection models.

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="./quarterly_report.pdf",
    strategy="hi_res",              # uses layout detection model (YOLOX)
    infer_table_structure=True,     # extract table structure
    extract_images_in_pdf=True,     # extract embedded images
    extract_image_block_output_dir="./extracted_images",
)

# Elements are typed: NarrativeText, Table, Image, Title, etc.
tables = [el for el in elements if el.category == "Table"]
texts = [el for el in elements if el.category == "NarrativeText"]
images = [el for el in elements if el.category == "Image"]

print(f"Found {len(tables)} tables, {len(texts)} text blocks, "
      f"{len(images)} images")

# Tables include HTML representation
for table in tables:
    print(table.metadata.text_as_html)  # <table><tr><td>...

How Unstructured partitions a PDF:

graph TD
    A["PDF Document"] --> B["Remove Embedded<br/>Image Blocks"]
    B --> C["YOLOX Layout<br/>Detection"]
    C --> D["Bounding Boxes:<br/>Tables, Titles, Text"]
    D --> E["Extract Table<br/>Structure (HTML)"]
    D --> F["Extract Section<br/>Titles"]
    D --> G["Extract Text<br/>Blocks"]
    D --> H["Extract Images"]
    E --> I["Typed Elements<br/>with Metadata"]
    F --> I
    G --> I
    H --> I

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#9b59b6,color:#fff,stroke:#333
    style D fill:#e67e22,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#27ae60,color:#fff,stroke:#333
    style G fill:#27ae60,color:#fff,stroke:#333
    style H fill:#27ae60,color:#fff,stroke:#333
    style I fill:#1abc9c,color:#fff,stroke:#333

Comparing Document Parsers

Parser	Open Source	Tables	Images	OCR	Layout Detection	Best For
PyPDF	Yes	Poor	No	No	No	Simple text PDFs
pdfplumber	Yes	Good (rule-based)	No	Basic	No	Tables with clear lines
Unstructured	Yes	Good (ML)	Yes	Yes	YOLOX	General-purpose, self-hosted
LlamaParse	API	Excellent (VLM)	Yes	Yes	VLM-based	Complex layouts, highest quality
Docling (IBM)	Yes	Good	Yes	Yes	DocLayNet	Enterprise, structured output
Surya	Yes	Good	No	Yes	Layout model	OCR-focused, multilingual

Approach 2: Multi-Vector Retrieval

Even with good parsing, a fundamental mismatch remains: tables and images don’t embed well as text. A table of numbers produces a poor embedding because embedding models are trained on natural language, not structured data.

The multi-vector retriever pattern solves this by decoupling what you index from what you retrieve:

Generate a natural language summary of each table/image (optimized for retrieval)
Embed the summary (what you search against)
Store the original table/image (what you pass to the LLM)

At query time, you match against summaries but feed raw content to the LLM.

graph TD
    A["Document"] --> B["Parser<br/>(Unstructured / LlamaParse)"]
    B --> C["Text Chunks"]
    B --> D["Tables"]
    B --> E["Images"]

    C --> F["Embed Text"]
    D --> G["LLM: Summarize Table"]
    E --> H["VLM: Describe Image"]

    G --> I["Embed Summary"]
    H --> J["Embed Description"]

    F --> K["Vector Store<br/>(Summaries + Embeddings)"]
    I --> K
    J --> K

    C --> L["Doc Store<br/>(Raw Content)"]
    D --> L
    E --> L

    M["Query"] --> K
    K -->|"Retrieve matching<br/>summary IDs"| L
    L -->|"Return raw content<br/>(text, table, image)"| N["LLM / VLM<br/>Generation"]
    M --> N
    N --> O["Answer"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#e67e22,color:#fff,stroke:#333
    style E fill:#9b59b6,color:#fff,stroke:#333
    style G fill:#e67e22,color:#fff,stroke:#333
    style H fill:#9b59b6,color:#fff,stroke:#333
    style K fill:#C8CFEA,color:#fff,stroke:#333
    style L fill:#C8CFEA,color:#fff,stroke:#333
    style N fill:#e74c3c,color:#fff,stroke:#333
    style O fill:#1abc9c,color:#fff,stroke:#333

LangChain: Multi-Vector Retriever for Tables

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.storage import InMemoryByteStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
import uuid

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Parse document into typed elements
# (assume tables and texts extracted via Unstructured)
table_elements = [...]   # raw table HTML/markdown
text_elements = [...]    # text blocks

# --- Step 1: Summarize tables ---
TABLE_SUMMARY_PROMPT = ChatPromptTemplate.from_template(
    "Summarize the following table in natural language. "
    "Describe what metrics it shows, key values, and trends.\n\n"
    "Table:\n{table}"
)

summarize_chain = TABLE_SUMMARY_PROMPT | llm | StrOutputParser()

table_summaries = []
for table in table_elements:
    summary = summarize_chain.invoke({"table": table})
    table_summaries.append(summary)

# --- Step 2: Build multi-vector retriever ---
vectorstore = FAISS.from_texts([], embeddings)
docstore = InMemoryByteStore()

retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=docstore,
    id_key="doc_id",
)

# Add text chunks (summary = text itself)
text_ids = [str(uuid.uuid4()) for _ in text_elements]
retriever.vectorstore.add_documents(
    [Document(page_content=t, metadata={"doc_id": id_})
     for t, id_ in zip(text_elements, text_ids)]
)
retriever.docstore.mset(
    list(zip(text_ids, [t.encode() for t in text_elements]))
)

# Add table summaries (index summary, store raw table)
table_ids = [str(uuid.uuid4()) for _ in table_elements]
retriever.vectorstore.add_documents(
    [Document(page_content=summary, metadata={"doc_id": id_})
     for summary, id_ in zip(table_summaries, table_ids)]
)
retriever.docstore.mset(
    list(zip(table_ids, [t.encode() for t in table_elements]))
)

# --- Step 3: Query ---
# Retriever matches against summaries, returns raw content
docs = retriever.invoke("What was Product A revenue in Q3?")
# docs contains the RAW table, not the summary

LlamaIndex: Multi-Modal Index with Summaries

from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.schema import TextNode, ImageNode
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)

llm = OpenAI(model="gpt-4o-mini", temperature=0)

# Summarize tables for better embedding
def summarize_table(table_text: str) -> str:
    response = llm.complete(
        f"Summarize this table concisely for retrieval:\n{table_text}"
    )
    return str(response)

# Create nodes with summary embeddings but raw table content
nodes = []

# Text nodes (embed directly)
for text_chunk in text_chunks:
    nodes.append(TextNode(text=text_chunk))

# Table nodes (embed summary, store raw for generation)
for table in tables:
    summary = summarize_table(table)
    node = TextNode(
        text=summary,  # embedded for retrieval
        metadata={"raw_table": table, "type": "table"},
    )
    nodes.append(node)

# Build index
index = VectorStoreIndex(nodes, show_progress=True)

# Custom query engine that uses raw tables for generation
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact",
)

Approach 3: Vision-Based Document Retrieval

The Problem with Text-First Pipelines

Even the best document parsers follow a fundamentally fragile pipeline:

OCR on scanned pages
Layout detection to segment elements
Structure reconstruction and reading order
Specialized models to caption figures and tables
Chunking
Text embedding

Each step can introduce errors that propagate downstream. ColPali (Faysse et al., 2024) challenges this entirely: skip text extraction and embed the page image directly.

ColPali: Embed the Page Image

ColPali uses a Vision Language Model (PaliGemma) to produce multi-vector embeddings from page images. Instead of extracting text and embedding it, ColPali:

Takes a screenshot of each document page
Splits it into visual patches via a vision transformer (SigLIP)
Projects patch embeddings through a language model (Gemma) for contextualization
Produces a multi-vector representation (one vector per patch)
Uses ColBERT-style late interaction to match query tokens against document patches

graph TD
    subgraph IDX["Indexing"]
        A["PDF Page<br/>(Image)"] --> B["Vision Transformer<br/>(SigLIP)"]
        B --> C["Patch Embeddings"]
        C --> D["Language Model<br/>(Gemma)"]
        D --> E["Contextualized<br/>Patch Vectors<br/>[N × 128 dims]"]
    end

    subgraph QRY["Querying"]
        F["User Query"] --> G["Language Model<br/>(Gemma)"]
        G --> H["Token Embeddings<br/>[M × 128 dims]"]
    end

    E --> I["Late Interaction<br/>(MaxSim per query token)"]
    H --> I
    I --> J["Relevance Score"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style C fill:#e67e22,color:#fff,stroke:#333
    style D fill:#C8CFEA,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#4a90d9,color:#fff,stroke:#333
    style G fill:#C8CFEA,color:#fff,stroke:#333
    style H fill:#27ae60,color:#fff,stroke:#333
    style I fill:#e74c3c,color:#fff,stroke:#333
    style J fill:#1abc9c,color:#fff,stroke:#333
    style IDX fill:#F2F2F2,stroke:#D9D9D9
    style QRY fill:#F2F2F2,stroke:#D9D9D9

Key insight: The late interaction mechanism means that for each query token, ColPali finds the most relevant visual patch on the page. This naturally handles tables (the patch containing “Q3, 15.8” will match a query about Q3 revenue), charts (axis labels and data points are visual patches), and mixed content.

Using ColPali

from colpali_engine.models import ColPali, ColPaliProcessor
import torch
from PIL import Image

# Load model
model = ColPali.from_pretrained(
    "vidore/colpali-v1.3",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.3")

# Index: embed page images
page_images = [Image.open(f"page_{i}.png") for i in range(num_pages)]
batch = processor.process_images(page_images)
with torch.no_grad():
    page_embeddings = model(**batch)  # list of multi-vector embeddings

# Query: embed the question
query = "What was the revenue growth in Q3?"
query_batch = processor.process_queries([query])
with torch.no_grad():
    query_embedding = model(**query_batch)

# Score via late interaction (MaxSim)
scores = processor.score_multi_vector(query_embedding, page_embeddings)
top_page_idx = scores[0].argmax().item()
print(f"Most relevant page: {top_page_idx}")

ColPali + Multimodal LLM for Full RAG

Once ColPali retrieves the right page(s), feed the page image to a multimodal LLM for answer generation:

import base64
from openai import OpenAI

client = OpenAI()

# Retrieve top page with ColPali (as above)
retrieved_page_image = page_images[top_page_idx]

# Convert to base64
import io
buffer = io.BytesIO()
retrieved_page_image.save(buffer, format="PNG")
image_b64 = base64.b64encode(buffer.getvalue()).decode()

# Generate answer from page image
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Based on this document page, answer the following "
                        "question. Only use information visible on the page.\n\n"
                        f"Question: {query}"
                    ),
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_b64}",
                    },
                },
            ],
        }
    ],
    max_tokens=500,
)

print(response.choices[0].message.content)

ColPali vs. Text-Based Retrieval

On the ViDoRe benchmark (Visual Document Retrieval), ColPali outperforms all text-based pipelines — including those using expensive captioning with Claude Sonnet:

Method	Pipeline Complexity	ViDoRe Score	Handles Visuals
BGE-M3 (text only)	OCR → chunk → embed	Baseline	No
BGE-M3 + Captioning	OCR → caption figures → chunk → embed	Better	Partial
Claude Sonnet Captioning	VLM caption everything → embed	Good	Yes (expensive)
ColPali	Screenshot → embed image	Best	Yes (native)

Approach 4: Multimodal Embeddings

Instead of embedding text summaries, embed images and text in the same vector space using multimodal embedding models.

OpenCLIP Embeddings

from langchain_experimental.open_clip import OpenCLIPEmbeddings

# Embeds both text and images into the same 512-dim space
embeddings = OpenCLIPEmbeddings(
    model_name="ViT-H-14",
    checkpoint="laion2b_s32b_b79k",
)

# Embed text
text_vectors = embeddings.embed_documents(["Revenue grew 15% in Q3"])

# Embed images
image_vectors = embeddings.embed_image(
    ["./chart_revenue.png", "./table_quarterly.png"]
)

# Both live in the same vector space — can be searched together

LangChain multimodal RAG with Chroma:

from langchain_chroma import Chroma
from langchain_experimental.open_clip import OpenCLIPEmbeddings

embeddings = OpenCLIPEmbeddings()

# Build vectorstore with both text and images
vectorstore = Chroma(
    collection_name="multimodal_docs",
    embedding_function=embeddings,
)

# Add text and image embeddings to the same collection
vectorstore.add_texts(texts=text_chunks)
vectorstore.add_images(uris=image_paths)

# Query retrieves both text and images by similarity
results = vectorstore.similarity_search("quarterly revenue chart", k=5)

Trade-offs: Multimodal Embeddings vs. Summarization

Approach	Pros	Cons
Multimodal embeddings (OpenCLIP)	Simple pipeline, same space for text + images	Limited model options, struggles with visually similar content
Summarize + text embed	Mature text embedding models, detailed descriptions	Higher complexity, cost of pre-computing summaries
ColPali (vision multi-vector)	Best accuracy, simplest pipeline, no text extraction	Higher storage (multi-vector), newer ecosystem

LangChain’s benchmark on slide decks showed the performance gap clearly:

Approach	Accuracy
Text-only RAG	20%
Multimodal embeddings (OpenCLIP)	60%
Multi-vector retriever (image summaries)	90%

Handling Tables Specifically

Tables are the most common semi-structured element and deserve focused attention.

Strategy 1: Preserve Markdown Tables

With LlamaParse or good parsing, tables become proper Markdown:

| Quarter | Product A | Product B | Total |
|---------|-----------|-----------|-------|
| Q1      | 12.3      | 8.7       | 21.0  |
| Q2      | 14.1      | 9.2       | 23.3  |
| Q3      | 15.8      | 10.1      | 25.9  |
| Q4      | 18.2      | 11.5      | 29.7  |

This embeds reasonably well and preserves structure for the LLM to read.

Strategy 2: Table Summarization for Retrieval

Generate a natural language summary for each table, embed the summary, but pass the raw table to the LLM:

TABLE_SUMMARY_PROMPT = """Describe this table for a search index.
Include: what metrics are shown, the time period, key values, 
notable trends, and any relationships between columns.

Table:
{table}

Summary:"""

Strategy 3: Table-Specific Query Engine

For documents with many tables, create a dedicated table retriever:

from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode

# Create nodes specifically from table summaries
table_nodes = []
for i, (table, summary) in enumerate(zip(raw_tables, table_summaries)):
    node = TextNode(
        text=summary,
        metadata={
            "raw_table": table,
            "table_index": i,
            "type": "table",
        },
    )
    table_nodes.append(node)

# Separate index for tables
table_index = VectorStoreIndex(table_nodes)
table_engine = table_index.as_query_engine(similarity_top_k=3)

This can be combined with the agentic approach from Agentic RAG: When Retrieval Needs Reasoning, where an agent routes table-specific questions to the table retriever.

End-to-End Pipeline: Multimodal RAG

LlamaIndex: Parse + Index + Query

from llama_cloud import AsyncLlamaCloud
from llama_index.core import VectorStoreIndex, Settings, Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)

# --- Step 1: Parse with LlamaParse ---
client = AsyncLlamaCloud(api_key="llx-...")
file_obj = await client.files.create(
    file="./report.pdf", purpose="parse"
)
result = await client.parsing.parse(
    file_id=file_obj.id,
    tier="agentic",
    output_options={
        "markdown": {"tables": {"output_tables_as_markdown": True}},
    },
    expand=["markdown"],
)

# Convert parsed pages to Documents
documents = []
for page in result.markdown.pages:
    documents.append(Document(
        text=page.markdown,
        metadata={"page_number": page.page_number},
    ))

# --- Step 2: Index ---
index = VectorStoreIndex.from_documents(
    documents, show_progress=True
)

# --- Step 3: Query ---
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What was the YoY revenue growth?")
print(response)

LangChain: Unstructured + Multi-Vector + GPT-4o

from unstructured.partition.pdf import partition_pdf
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.storage import InMemoryByteStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
import uuid

# --- Step 1: Parse ---
elements = partition_pdf(
    filename="./report.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_images_in_pdf=True,
    extract_image_block_output_dir="./images",
)

tables = [el for el in elements if el.category == "Table"]
texts = [el for el in elements if el.category == "NarrativeText"]

# --- Step 2: Summarize tables ---
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

summarize_prompt = ChatPromptTemplate.from_template(
    "Summarize this table. Describe what it shows, key values, and trends.\n"
    "Table:\n{table}"
)
summarize_chain = summarize_prompt | llm | StrOutputParser()

table_summaries = [
    summarize_chain.invoke({"table": t.metadata.text_as_html})
    for t in tables
]

# --- Step 3: Build multi-vector retriever ---
vectorstore = FAISS.from_texts(["placeholder"], embeddings)
docstore = InMemoryByteStore()
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=docstore,
    id_key="doc_id",
)

# Add text elements
for text_el in texts:
    doc_id = str(uuid.uuid4())
    retriever.vectorstore.add_documents([
        Document(page_content=str(text_el), metadata={"doc_id": doc_id})
    ])
    retriever.docstore.mset([(doc_id, str(text_el).encode())])

# Add tables (index summary, store raw)
for summary, table_el in zip(table_summaries, tables):
    doc_id = str(uuid.uuid4())
    retriever.vectorstore.add_documents([
        Document(page_content=summary, metadata={"doc_id": doc_id})
    ])
    raw = table_el.metadata.text_as_html or str(table_el)
    retriever.docstore.mset([(doc_id, raw.encode())])

# --- Step 4: RAG chain ---
prompt = ChatPromptTemplate.from_template(
    "Answer based on this context. Tables may be in HTML format.\n\n"
    "Context:\n{context}\n\nQuestion: {question}"
)

rag_chain = (
    {
        "context": retriever | (
            lambda docs: "\n\n".join(
                d.decode() if isinstance(d, bytes) else d.page_content
                for d in docs
            )
        ),
        "question": RunnablePassthrough(),
    }
    | prompt
    | llm
    | StrOutputParser()
)

answer = rag_chain.invoke("What was Product A revenue in Q3?")
print(answer)

Choosing the Right Approach

graph TD
    A["What kind of documents?"] --> B{"Mostly text<br/>with some tables?"}
    B -->|Yes| C["LlamaParse / Unstructured<br/>+ Standard RAG"]
    B -->|No| D{"Charts, diagrams,<br/>images matter?"}
    D -->|No, tables only| E["Multi-Vector Retriever<br/>(Table summaries)"]
    D -->|Yes| F{"Need page-level<br/>retrieval?"}
    F -->|Yes| G["ColPali +<br/>Multimodal LLM"]
    F -->|No| H["Multi-Vector Retriever<br/>+ VLM Summarization"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#f5a623,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#f5a623,color:#fff,stroke:#333
    style G fill:#9b59b6,color:#fff,stroke:#333
    style H fill:#e67e22,color:#fff,stroke:#333

Scenario	Recommended Approach	Why
Text-heavy PDFs with some tables	LlamaParse (agentic tier) → standard RAG	Good table extraction, minimal complexity
Financial reports with many tables	Multi-vector retriever with table summarization	Summaries improve retrieval; raw tables for accurate LLM answers
Slide decks and presentations	ColPali or multi-vector with VLM summaries	Visuals carry the information
Research papers (figures + equations)	LlamaParse + vision descriptions	Math and figures need specialized handling
Scanned legacy documents	Unstructured (hi_res) + OCR	Layout detection + OCR essential
Mixed corpus (all types)	Agent with multiple tools (text index, table index, image search)	Route queries to appropriate retriever

Common Pitfalls

1. Treating All Content as Text

Problem: Flattening tables to text destroys structure. Charts become invisible.

Fix: Use a parser that preserves element types (Unstructured, LlamaParse). Handle each type differently — summarize tables, describe images, embed text.

2. Embedding Raw HTML Tables

Problem: Embedding raw <table><tr><td> HTML produces poor vectors because embedding models aren’t trained on HTML.

Fix: Summarize tables in natural language for the embedding step. Store raw HTML for the LLM generation step (LLMs read HTML well).

3. Ignoring Image Context

Problem: Extracting images from a document but not capturing surrounding text loses context (e.g., figure captions, section headers).

Fix: When extracting images, include adjacent text (captions, headers) in the metadata. Embed the combined text + caption.

4. Using VLMs for Everything

Problem: Running GPT-4o on every page image is slow and expensive.

Fix: Use a tiered approach — fast text extraction for simple pages, VLM only for complex layouts. LlamaParse tiers handle this automatically.

5. Not Evaluating Retrieval Separately

Problem: End-to-end evaluation hides whether the bottleneck is parsing, retrieval, or generation.

Fix: Evaluate each step independently. Check: (a) does the parser extract the table correctly? (b) does retrieval return the right element? (c) does the LLM read the element correctly?

Summary

Concept	Key Takeaway
Text-only RAG limitation	Flattens tables, drops images, breaks on complex layouts
Intelligent parsing	LlamaParse and Unstructured extract typed elements (text, tables, images)
Multi-vector retrieval	Embed summaries for search, store raw content for generation
ColPali	Embed page images directly with vision multi-vectors — simplest, highest accuracy
Multimodal embeddings	CLIP/OpenCLIP put text and images in same space — simple but less accurate
Table handling	Summarize for retrieval, preserve structure (Markdown/HTML) for generation
Production choice	Start with LlamaParse + standard RAG; add multi-vector or ColPali where evaluation shows visual content matters

The key principle: don’t throw information away. If a document communicates through tables, charts, and layout, your retrieval pipeline must preserve that information — either through faithful parsing or by directly embedding the visual representation.

For the foundational pipeline these approaches extend, see Building a RAG Pipeline from Scratch. For chunking strategies for parsed text, see Advanced Chunking Strategies for RAG. For selecting embedding models, see Embedding Models and Reranking for RAG. For graph-based approaches to structured document data, see GraphRAG: Knowledge Graphs Meet Retrieval-Augmented Generation. For building agents that route across text, table, and image retrievers, see Agentic RAG: When Retrieval Needs Reasoning.

References

Faysse, Sibille, Wu et al., ColPali: Efficient Document Retrieval with Vision Language Models, ICLR 2025. arXiv:2407.01449
LangChain Blog, Multi-Vector Retriever for RAG on tables, text, and images, 2023. Blog
LangChain Blog, Multi-modal RAG on slide decks, 2023. Blog
LlamaIndex Documentation, LlamaParse Getting Started, 2026. Docs
Unstructured Documentation, Partitioning PDFs, 2026. Docs
ViDoRe Leaderboard, Visual Document Retrieval Benchmark, HuggingFace, 2026. Leaderboard

Evaluate your multimodal pipeline with RAG evaluation metrics to quantify gains from image and table retrieval.
Build an agentic RAG system that routes queries to text, table, or image retrievers dynamically.
Combine visual retrieval with GraphRAG for documents with complex entity-relationship structures.
Scale your multimodal pipeline to production with caching, observability, and cost optimization.